### Setup
%matplotlib inline
# %load_ext pretty_jupyter

# should enable plotting without explicit call .show()

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
from matplotlib.colors import LogNorm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from collections import Counter

# classes for special types
from pandas.api.types import CategoricalDtype

# Apply the default theme
sns.set_theme()

Introduction

In this report, we're going to predict how many goals will NHL players score next season. The prediction will be based on various player's data from past seasons.

Dataset overview

The provided dataset consists of 2 csv files:

nhl-teams.csv:

This file contains names of NHL teams together with their short name. There are 35 rows (1 row = 1 team) and only 2 columns. The structure is as follows:

# Reading and inspecting data
df = pd.read_csv("data/nhl-teams.csv")
df.head(5)
team team_full
0 ANA Anaheim Ducks
1 ARI Arizona Coyottes
2 ATL Atlanta Thrashers
3 BOS Boston Bruins
4 BUF Buffalo Sabres

nhl-player-data.csv

This file contains various data about NHL players (goals, points per season, average time on ice, age, etc.) for seasons 2004-2018. Each row contains data for the given player per season, i.e. if the player has played multiple seasons in NHL, there will be multiple rows containing his data (1 row per each season).

Size of the dataset is about 1.34 MB
There are 12328 rows and 32 columns

Structure is as follows:

df = pd.read_csv("data/nhl-player-data.csv")
df.head(5)
Rk Player Nick Age Pos Tm GP G A PTS ... TOI ATOI BLK HIT FOW FOL FO_percent HART Votes Season
0 1 Connor McDavid mcdavco01 20 C EDM 82 30 70 100 ... 1733 21.133333 29.0 34 348.0 458.0 43.2 1 1604 2017
1 2 Sidney Crosby crosbsi01 29 C PIT 75 44 45 89 ... 1491 19.883333 27.0 80 842.0 906.0 48.2 0 1104 2017
2 3 Patrick Kane kanepa01 28 RW CHI 82 34 55 89 ... 1754 21.400000 15.0 28 7.0 44.0 13.7 0 206 2017
3 4 Nicklas Backstrom backsni02 29 C WSH 82 23 63 86 ... 1497 18.266667 33.0 45 685.0 648.0 51.4 0 60 2017
4 5 Nikita Kucherov kucheni01 23 RW TBL 74 40 45 85 ... 1438 19.433333 20.0 30 0.0 0.0 0.0 0 119 2017

5 rows × 32 columns

Types of the columns are displayed below:

df.dtypes
Rk              int64
Player         object
Nick           object
Age             int64
Pos            object
Tm             object
GP              int64
G               int64
A               int64
PTS             int64
plusminus       int64
PIM             int64
PS            float64
EV              int64
PP              int64
SH              int64
GW              int64
EV.1            int64
PP.1            int64
SH.1            int64
S               int64
S_percent     float64
TOI             int64
ATOI          float64
BLK           float64
HIT             int64
FOW           float64
FOL           float64
FO_percent    float64
HART            int64
Votes           int64
Season          int64
dtype: object

As we mentioned before, we're mostly interested in predicting goals per season. The distribution of goals per season in our dataset looks like this:

g = sns.histplot(data=df, x="G", binwidth=5)
plt.xlabel("Goals per season")
plt.ylabel("Count of players")

plt.show()

Here are the descriptive statistics:

df["G"].describe()
count    12328.000000
mean         7.484263
std          8.846936
min          0.000000
25%          1.000000
50%          4.000000
75%         11.000000
max         65.000000
Name: G, dtype: float64

We can see, that the average player scores about 7.5 goals per season, the maximum is 65 goals per season, and over 20% of players didn't score any goal in a season.

The graph below shows distribution of players by season.

g = sns.histplot(data=df, x="Season", discrete=True)

We can see that the data are evenly distributed between seasons, however data for 2005 are missing because the season was cancelled.

Missing values

In total there are 438 missing cells in the dataset, which is about 0.1% of all cells. These are mostly concentrated in two columns:

  • FO_percent (faceoff win percentage)
  • S_percent (shooting percentage)

In my opinion, this is most likely caused by the fact that the given players didn't take any faceoffs during the season (e.g. defensemen typically don't take faceoffs), or respectively didn't shoot at the goal (maybe the given player played just 1 game in whole season). Both cases will result in division by zero.

There is also 1 missing value in each of the following columns: BLK, FOW and FOL.

Missing values aren't denoted by any special strings (such as "None" or "Null"), there are just 2 consecutive commas in the given row.

There is one row with player whose age is 0, which is definitely an error in the dataset.

Exploratory analysis

We're going to explore relations in the dataset.

Note: Players who scored 0 goals in a given season are excluded from all graphs.

Goals & player's age

Firstly, we're going to explore whether the player's age correlates with the goals scored in any way.

df = df.loc[df["G"] > 0]
g = sns.JointGrid(data=df, x="Age",y="G")
g1 = g.plot(sns.histplot, sns.histplot, discrete=True)
g2 = g1.set_axis_labels("Age", "Goals per season")

From the graph above, we see that suprisingly, there is no visible correlation between these 2 variables. The best results belong to players aged 20-25 results, however these players are also the most represented group in the dataset.

Goals & shooting percentage

In this part, we'll explore whether the goals scored depend on player's shooting percentage.
Here is a graph showing relation of these 2 variables.

df_scored = df.loc[df["S_percent"] > 0]
g = sns.JointGrid(data=df_scored, x="S_percent",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Shooting percentage", "Goals per season")

From the graph, it seems that the number of goals directly depends on shooting percentage. So, we can formulate the following hypothesis: The better shoting percentage, the more goals scored.

Of course, this doesn't apply to all data (e.g. there are several players with shooting percentage >= 50% and none of them scored more than 5 goals), however on average it seems to be true.

Goals & average time on ice

In this section, we're going to explore relation between goals scored and average time on ice.
Here is the graph showing relation between these 2 variables:

g = sns.JointGrid(data=df, x="ATOI",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Average time on ice", "Goals per season")

We can observe that the graph contains 2 clusters (one goes from average time on ice = 10 up right and second one goes from the same position to the right).

In my opinion, the first cluster contains forwards and the second one contains defencemen, who often spend a lot of time on ice, however don't score as much as forwards. There is also some strange data at the left showing players who spend on average < 5 minutes on ice, but score many goals (over 10 or even 20 goals per season).

The following graph shows the same relation but this time only for forwards:

df_forwards = df.loc[df["Pos"].isin(["C","LW","RW"])]

g = sns.JointGrid(data=df_forwards, x="ATOI",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Average time on ice", "Goals per season")

We can see that the first cluster indeed mostly consists of forwards.
The graph also suggests that for forwards, we can formulate the following hypothesis: Number of scored goals directly depends on average time on ice (the higher ice time, the more goals).

The same graph but for defencemen, looks like this:

df_forwards = df.loc[df["Pos"] == "D"]

g = sns.JointGrid(data=df_forwards, x="ATOI",y="G")
g1 = g.plot(sns.histplot, sns.histplot, binwidth=1)
g2 = g1.set_axis_labels("Average time on ice", "Goals per season")

We can observe that most of the strange data (players who spend on average < 5 minutes on ice, but score many goals) from the graph above belong to defencemen.

After looking at these suspicious data, we see that the average time on ice is miscalculated. If we divide total time on ice (TOI) by games played (GP), we should get average time on ice (ATOI). However, for these data ATOI is way smaller (mostly in the range 0-5) than the real value (see ATOI_real column in the table below). Therefore, it's an error in the data (for further data processing, we may recalculate ATOI values).

df_sus = df.loc[(df["G"] > 10) & (df["ATOI"] <= 2) & (df["Pos"] == "D"),["Player","Season","Pos","GP","G","TOI","ATOI"]]
df_sus["ATOI_real"] = df_sus["TOI"] / df_sus["GP"]
df_sus.head(5)
Player Season Pos GP G TOI ATOI ATOI_real
8 Brent Burns 2017 D 82 29 2039 0.866667 24.865854
15 Victor Hedman 2017 D 79 16 1936 0.500000 24.506329
107 Roman Josi 2017 D 72 12 1805 1.066667 25.069444
110 Alex Pietrangelo 2017 D 80 14 2023 1.283333 25.287500
148 Shea Weber 2017 D 78 17 1955 1.066667 25.064103

After exluding these error data, we see that for defencemen, there is a similar hypothesis: Higher average time on ice typically means more scored goals.

Data preparation

Cleaning

As mentioned before, there are some wrong data in the dataset. I executed the following steps in order to clean it:

  • removed the player, whose age is 0
  • removed the players that had missing values in columns BLK, FOW and FOL (3 rows in total)
  • replaced missing values in S_percent column by zeros. I checked that all of the players, whose S_percent value is missing, scored zero goals, therefore setting it to zero is correct.
  • similarly, I replaced missing FO_percent values by zeros (again, I checked that these players didn't win any faceoff).
  • recalculated all ATOI values (as mentioned in the previous section, some of the values are wrong).
df = pd.read_csv("data/nhl-player-data.csv")
df = df.loc[df["BLK"].notna() & (df["FOW"].notna()) & (df["FOL"].notna()) & (df["Age"] > 0)]
df["ATOI"] = df["TOI"] / df["GP"]

df["S_percent"] = df["S_percent"].fillna(0)
df["FO_percent"] = df["FO_percent"].fillna(0)

Transformation

For predicting number of scored goals in the next season, I used only players, who played at least 3 consecutive seasons in NHL. If a player has played more seasons in NHL, he will be present in multiple rows of training data (one row for each triple of consecutive seasons).

I used the limit of 3 seasons, because more than half of the players satisfies the condition (and therefore will be included in training data) and it's showing some progress of the player. If the limit was lower, there will probably be too little information (only 1 season in training data).

I extracted the following features from data:

  • goals scored
  • shooting percentage
  • average time on ice

for first and second season. (The number of scored goals in the last (i.e. 3rd) season will be used as a target.)

In addition to shooting percentage and ATOI mentioned in hypotheses, I also added goals scored in the seasons as features, because in my opinion it has a big impact on the prediciton for next season. Furthermore, I also added one-hot encoded position of the player into features, because it affects the prediction quite a lot (as shown in the relation between ATOI and goals).

The transformed data look like this:

TRAIN_SEASONS = 2

def most_common(lst):
    return max(set(lst), key=lst.count)

def get_four_consecutives(numbers):
    numbers.sort()
    if len(numbers) < TRAIN_SEASONS + 1:
        return None
        
    consecutives, has_results = [], False
    
    for i in range(len(numbers) - TRAIN_SEASONS):
        y1, y2, y3 = numbers[i:i + TRAIN_SEASONS + 1]
        if y1 + 1 == y2 and y1 + 2 == y3:
            consecutives += [(y1, y2, y3)]
            has_results = True

    return consecutives if has_results else None


def normalize(row):
    seasons, last_season = row["Con_seasons"][:-1], row["Con_seasons"][-1]
    indexes = [row["Season"].index(s) for s in seasons]
    target_index = row["Season"].index(last_season)

    row["target"] = row["G"][target_index]
    row["Pos"] = most_common(row["Pos"])

    for col_index, i in enumerate(indexes):
        row[f"G_{col_index}"] = row["G"][i]
        row[f"S_percent_{col_index}"] = row["S_percent"][i]
        row[f"ATOI_{col_index}"] = row["ATOI"][i]
    
    return row

# Group the data by player
players_agg = df.groupby(["Player"]).aggregate(lambda x: list(x))
players_agg["Con_seasons"] = players_agg["Season"].apply(get_four_consecutives)
players_agg = players_agg.loc[players_agg["Con_seasons"].notna()]
players_agg = players_agg.explode("Con_seasons")

# create features for each season
players_agg = players_agg.apply(normalize, axis=1)

# select the right columns
columns = ["Pos"]
for i in range(TRAIN_SEASONS):
    columns += [f"G_{i}", f"S_percent_{i}", f"ATOI_{i}"]

targets = players_agg["target"]
players_agg = players_agg[columns]

# Get one hot encoding of Position
one_hot = pd.get_dummies(players_agg['Pos'], prefix='Pos')
players_agg = players_agg.drop('Pos', axis = 1)
players_agg = pd.concat([players_agg, one_hot], axis=1)

players_agg.sample(n=10)
G_0 S_percent_0 ATOI_0 G_1 S_percent_1 ATOI_1 Pos_C Pos_D Pos_LW Pos_RW
Player
Zach Boychuk 3 8.6 10.612903 1 6.7 10.181818 1 0 0 0
Mathew Dumba 8 9.3 15.017241 1 8.3 12.461538 0 1 0 0
Vladimir Tarasenko 37 14.0 17.623377 21 15.4 15.171875 0 0 0 1
Ed Jovanovski 12 5.0 22.550000 11 8.1 23.148148 0 1 0 0
Derek MacKenzie 9 12.9 11.267606 3 9.1 10.348837 1 0 0 0
Kris Newbury 0 0.0 5.000000 0 0.0 7.833333 1 0 0 0
Danny Richmond 0 0.0 9.428571 0 0.0 12.318182 0 1 0 0
Benoit Pouliot 5 14.7 11.837838 2 20.0 8.818182 0 0 1 0
Mike Hoffman 0 0.0 12.333333 0 0.0 9.000000 1 0 0 0
Alex Petrovic 2 3.8 16.954545 0 0.0 16.272727 0 1 0 0

Modeling

The model is based on Linear regression with cross-validation.

I implemented the cross-validation using GridSearchCV model from sklearn library with the following parameters:

  • number of folds: 5 (default value)
  • Regression parameters for GridSearch: {'positive': [True, False]} . Thus, GridSearch will search over specified parameter values for the best estimator, and use it.

I used 90% of the data for training and the remaining 10% for testing.

Parameters of the model (weights of the features) are displayed below:

X_train, X_test, y_train, y_test = train_test_split(players_agg, targets, train_size=0.9)

parameters = {'positive': [True, False]}

model = GridSearchCV(LinearRegression(), parameters, scoring="neg_root_mean_squared_error").fit(X_train, y_train)
feature_weights = pd.DataFrame({'feature' : []})
feature_weights["feature"] = players_agg.columns
feature_weights["weight"] = model.best_estimator_.coef_ 

feature_weights
feature weight
0 G_0 0.248750
1 S_percent_0 -0.045150
2 ATOI_0 0.056741
3 G_1 0.313462
4 S_percent_1 -0.077204
5 ATOI_1 0.558961
6 Pos_C 1.130136
7 Pos_D -5.430978
8 Pos_LW 2.173465
9 Pos_RW 2.127377

We see that our model assigned the highest weights to Pos_LW and Pos_RW features, while Pos_D has the smallest (negative) weight. As mentioned in hypotheses, position of the player definitely affects the number of scored goals (forwards simply score more than defencemen), so this makes sense. However, I expected that goals scored in previous seasons (features G_0 and G_1) would have the highest weights (i.e. the biggest impact on prediction).

The model set very small weights (for shooting percentage even negative) to S_percent_0, ATOI_0 and S_percent_1 features, which is surprising.

On the other hand, weight of ATOI_1 feature is much higher, which makes sense in my opinion, because the more recent data should have bigger impact on prediction.

Evaluation

For evaluation I used root-mean-square error score.

I evaluated the model on testing data, the resulting value is shown below:

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions, squared=False)
mse
5.893971181447213

RMSE of the model is ~ 6, which means that on average, the predictions differs from the target value by 6 goals, which is in my opinion quite decent result.

Below, there is a graph showing comparison between predicted values for testing data and real targets:

results_df = pd.DataFrame({'target' : []})
results_df["target"] = y_test
results_df["prediction"] = predictions

g = sns.scatterplot(data=results_df, x="target", y="prediction")

We see that the model typically over-estimates the number of scored goals for small targets. On the other hand, for targets > 20, the predictions are typically under-estimated.

The following graph shows the distribution of difference target - prediction in test data.

differences_df = pd.DataFrame({'difference' : []})
differences_df['difference'] = y_test - predictions
g = sns.histplot(data=differences_df, x="difference", binwidth=2, binrange=(-20,30))

We can see that the data are roughly evenly distributed around 0.

Conclusions

The goal of this report was to predict how many goals will NHL players score next season, based on various data from past 2 seasons.

I created a Linear regression model with 5-fold cross-validation and used RMSE for its evaluation. RMSE on testing data was about 6, i.e. on average, the predictions differs from the target by 6 goals.